The relative sensitivity of different alignment methods and character codings in sensitivity analysis
نویسندگان
چکیده
Sensitivity analysis provides a way to measure robustness of clades in sequence-based phylogenetic analyses to variation in alignment parameters rather than measuring their branch support. We compared three different approaches to multiple sequence alignment in the context of sensitivity analysis: progressive pairwise alignment, as implemented in MUSCLE; simultaneous multiple alignment of sequence fragments, as implemented in DCA; and direct optimization followed by generation of the implied alignment(s), as implemented in POY. We set out to determine the relative sensitivity of these three alignment methods using rDNA sequences and randomly generated sequences. A total of 36 parameter sets were used to create the alignments, varying the transition, transversion, and gap costs. Tree searches were performed using four different character-coding and weighting approaches: the cost function used for alignment or equally weighted parsimony with gap positions treated as missing data, separate characters, or as fifth states. POY was found to be as sensitive, or more sensitive, to variation in alignment parameters than DCA and MUSCLE for the three empirical datasets, and POY was found to be more sensitive than MUSCLE, which in turn was found to be as sensitive, or more sensitive, than DCA when applied to the randomly generated sequences when sensitivity was measured using the averaged jackknife values. When significant differences in relative sensitivity were found between the different ways of weighting characterstate changes, equally weighted parsimony, for all three ways of treating gapped positions, was less sensitive than applying the same cost function used in alignment for phylogenetic analysis. When branch support is incorporated into the sensitivity criterion, our results favour the use of simultaneous alignment and progressive pairwise alignment using the similarity criterion over direct optimization followed by using the implied alignment(s) to calculate branch support. The Willi Hennig Society 2008. Sensitivity analysis provides a way to measure robustness of clades in sequence-based phylogenetic analyses to variation in alignment parameters rather than measuring their branch support (Wheeler, 1995; Giribet, 2002, 2003; Goloboff et al., 2003; Giribet and Wheeler, 2007; but see Farris, 2004; Grant and Kluge, 2005), as with the bootstrap (Felsenstein, 1985) and jackknife (Farris et al., 1996) for an alignment created using a single set of alignment parameters. The less sensitive a clade is to variation in alignment parameters the better. For example, Wheeler (1995, p. 328) stated that: ‘‘If a high fraction of the total analysis space supports a group, the group is generally supported by the data because most combinations of analytical parameters will yield that clade, especially if the areas of support are contiguous. If, however, the areas in which the clade is supported are broken up and distributed over the space, this group (however general) would be unstable because small perturbations in analysis would lead to a new result’’. Giribet (2003, p. 559) asserted that ‘‘Stability under different parameters ⁄models may well become the preferred criterion for taxonomic revision’’. In the present study, we used rDNA sequences as well as randomly generated sequences to determine the relative sensitivity (as measured by congruence of trees inferred by the alignments) of different multiple-alignment methods to variation in alignment parameters. We compared the following three approaches to global multiple sequence alignment: progressive pairwise *Corresponding author: Fax +1 970 491 0649. E-mail address: [email protected] The Willi Hennig Society 2008 Cladistics 10.1111/j.1096-0031.2008.00230.x Cladistics 24 (2008) 1039–1050 alignment (Feng and Doolittle, 1987), as implemented in MUSCLE (Edgar, 2004a,b); simultaneous multiple alignment of sequence fragments (Tönges et al., 1996), as implemented in DCA (Stoye, 1998); and direct optimization (Wheeler, 1996) followed by generation of implied alignments (Wheeler, 2003), as implemented in POY (Wheeler et al., 2003). In progressive pairwise alignment, the most similar pair(s) of sequences is aligned first, followed by progressively less similar pairs or sets of sequences until all sequences are aligned in a single multiple alignment. Lake (1991) demonstrated that the order used in progressive pairwise alignment can undesirably determine the topology of the phylogeny inferred from the aligned sequences (see also Thorne et al., 1991; Thorne and Kishino, 1992). Lake (1991) suggested that the solution to this problem would be to restrict phylogenetic inference to using characters from those regions for which the alignment is independent of any particular progressive alignment order. Alternatively, simultaneous alignment may potentially be used to eliminate any biases that may be caused by use of any particular progressive alignment order. The alignment criterion for both progressive pairwise alignment and simultaneous alignment is similarity. Direct optimization differs from the other methods examined in that its alignment criterion is likelihood or parsimony, rather than similarity (Wheeler, 1996, 2006). The alignment(s) that produces the tree(s) with the highest likelihood or fewest steps is favoured over alternative alignments (but note that direct optimization does not necessarily create any single multiple alignment; Wheeler, 1996). This is accomplished by integrating alignment and phylogenetic tree search into a single step, after which an implied alignment may then be produced. Note that implied alignments are equivalent to secondary, not primary, homology statements (de Pinna, 1991), in contrast to DCA and MUSCLE alignments. Our a priori hypothesis on the relative sensitivity of the three alignment methods to variation in alignment parameters (when branch support is incorporated into the sensitivity measure) was that direct optimization would be more sensitive than progressive pairwise alignment, which would be more sensitive than simultaneous alignment. This hypothesis was based on the use of a single order (or in some cases two or more orders for direct optimization when equally optimal alignments are found) in which entire sequences are aligned (in the case of progressive pairwise alignment) or optimized (in the case of direct optimization), in contrast to simultaneous alignment. The use of a single tree (or subset of possible trees) to determine alignment or optimization order produces alignments from which phylogenetic trees are derived that are subject to the artifacts described by Lake (1991). Direct optimization optimizes sequences at internal nodes. In contrast, progressive pairwise alignment relies on pairwise comparisons among each sequence in one set of sequences being aligned relative to all sequences in the other set of sequences that are being aligned (e.g., see Thompson et al., 1994, fig. 2). As such, direct optimization is more effective at minimizing the number of substitutions and indels required to change from one sequence to another on the most parsimonious tree(s). Hence, the implied alignment(s) derived from direct optimization is more closely linked to its associated tree topology (or topologies, in some cases where two or more equally optimal alignments are reported), and their bias in favour of this tree topology will be more pronounced than a progressive pairwise alignment that used the same topology to determine the pairwise alignment order. This comes into play for sensitivity analysis because the tree that is used to guide optimization or pairwise alignment is expected to vary when different costs are assigned to the various alignment parameters. Whichever tree(s) is chosen will set the course for the alignments to be biased in favour of that tree. This hypothesis is generally consistent with results from previous studies that have conducted sensitivity analyses using different alignment methods. Terry and Whiting (2005), Pons and Vogler (2006), and Sharkey et al. (2006) (see also Shull et al., 2001; Caterino and Vogler, 2002; Ogden and Whiting, 2003) compared progressive pairwise alignments generated by Clustal (Thompson et al., 1994) with direct optimization in POY. Unfortunately, however, it is impossible to set identical alignment parameters between Clustal and POY (Ogden and Whiting, 2003). With that qualification in mind, Clustal alignments were found to be less sensitive than POY implied alignments as measured by the incongruence length difference (Mickevich and Farris, 1981) by Terry and Whiting (2005) and taxonomic congruence (Nelson, 1979) by Pons and Vogler (2006) and Sharkey et al. (2006). A further difficulty when using taxonomic congruence to compare trees inferred from progressive pairwise alignments with trees generated by direct optimization is that character-state changes are generally weighted differently. Whereas POY implements the ‘‘logically consistent’’ (LC) approach advocated by Wheeler (1994; see also Giribet and Wheeler, 1999; Phillips et al., 2000) wherein the same cost function used in alignment is also used for phylogenetic analysis, both Pons and Vogler (2006) and Sharkey et al. (2006) used equally weighted parsimony to infer trees based on their Clustal alignments. Unlike Wheeler (1994), Simmons and Ochoterena (2000, pp. 369–370; see also Simmons, 2004) considered alignment and tree search to be logically independent of each other, and therefore did not consider it necessary to use the same cost function in alignment and phylogenetic analysis. 1040 M.P. Simmons et al. / Cladistics 24 (2008) 1039–1050
منابع مشابه
gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملA Parametric Study for Identification of the Variables in Tidally Affected Pond Systems (TECHNICAL NOTE)
In this paper a parametric study was undertaken to quantify the sensitivity Abstract to a wide range of coastal detention pond systems dealing with tidal influence to store surface flood water so as to produce general guidance on the importance of the catchment and pond variables. In this process a specified pond design return period was selected for which the system was to be designed. The pon...
متن کاملتحلیل حساسیت پارامترهای مدل SWAT در حوزه آبخیز چهلچای استان گلستان
Over-parameterization is a well-known and often described problem in hydrological models, especially in distributed models. Therefore, using special methods to reduce the number of parameters via sensitivity analysis is important to achieve efficiency. This paper describes a sensitivity analysis strategy that graphically assigns for each parameter a relative sensitivity index and relationship o...
متن کاملSensitivity Analysis of Meteorological Parameters in Runoff Modelling Using SWAT (Case Study: Kasillian Watershed)
Determination of river runoff is essential in design and construction of most hydraulic structures including dams. In rivers with no measurement stations, the hydraulic models can be used for data estimation. SWAT is one of the most widely-used numerical models. In this model, input influential meteorological data as precipitation, temperature, wind speed, solar radiation and relative humidity...
متن کاملDistorted Reflector Antennas: Radiation Pattern Sensitivity to the Surface Distortions
The high-frequency performance of the reflector antennas is mainly limited by the surface. It has been shown that distortions on different regions of the reflector surface can have different effects on the radiation performance. In other words, degradation of the radiation pattern due to the presence of surface distortions is sensitive to the location and behavior of the surface distortion prof...
متن کامل